71 research outputs found
Visualizing and Understanding Sum-Product Networks
Sum-Product Networks (SPNs) are recently introduced deep tractable
probabilistic models by which several kinds of inference queries can be
answered exactly and in a tractable time. Up to now, they have been largely
used as black box density estimators, assessed only by comparing their
likelihood scores only. In this paper we explore and exploit the inner
representations learned by SPNs. We do this with a threefold aim: first we want
to get a better understanding of the inner workings of SPNs; secondly, we seek
additional ways to evaluate one SPN model and compare it against other
probabilistic models, providing diagnostic tools to practitioners; lastly, we
want to empirically evaluate how good and meaningful the extracted
representations are, as in a classic Representation Learning framework. In
order to do so we revise their interpretation as deep neural networks and we
propose to exploit several visualization techniques on their node activations
and network outputs under different types of inference queries. To investigate
these models as feature extractors, we plug some SPNs, learned in a greedy
unsupervised fashion on image datasets, in supervised classification learning
tasks. We extract several embedding types from node activations by filtering
nodes by their type, by their associated feature abstraction level and by their
scope. In a thorough empirical comparison we prove them to be competitive
against those generated from popular feature extractors as Restricted Boltzmann
Machines. Finally, we investigate embeddings generated from random
probabilistic marginal queries as means to compare other tractable
probabilistic models on a common ground, extending our experiments to Mixtures
of Trees.Comment: Machine Learning Journal paper (First Online), 24 page
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification
Sigmoid output layers are widely used in multi-label classification (MLC)
tasks, in which multiple labels can be assigned to any input. In many practical
MLC tasks, the number of possible labels is in the thousands, often exceeding
the number of input features and resulting in a low-rank output layer. In
multi-class classification, it is known that such a low-rank output layer is a
bottleneck that can result in unargmaxable classes: classes which cannot be
predicted for any input. In this paper, we show that for MLC tasks, the
analogous sigmoid bottleneck results in exponentially many unargmaxable label
combinations. We explain how to detect these unargmaxable outputs and
demonstrate their presence in three widely used MLC datasets. We then show that
they can be prevented in practice by introducing a Discrete Fourier Transform
(DFT) output layer, which guarantees that all sparse label combinations with up
to active labels are argmaxable. Our DFT layer trains faster and is more
parameter efficient, matching the F1@k score of a sigmoid layer while using up
to 50% fewer trainable parameters. Our code is publicly available at
https://github.com/andreasgrv/sigmoid-bottleneck.Comment: Published at AAAI2
Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification
Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a lowrank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to k active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck
Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures
Probabilistic graphical models are a central tool in AI; however, they are
generally not as expressive as deep neural models, and inference is notoriously
hard and slow. In contrast, deep probabilistic models such as sum-product
networks (SPNs) capture joint distributions in a tractable fashion, but still
lack the expressive power of intractable models based on deep neural networks.
Therefore, we introduce conditional SPNs (CSPNs), conditional density
estimators for multivariate and potentially hybrid domains which allow
harnessing the expressive power of neural networks while still maintaining
tractability guarantees. One way to implement CSPNs is to use an existing SPN
structure and condition its parameters on the input, e.g., via a deep neural
network. This approach, however, might misrepresent the conditional
independence structure present in data. Consequently, we also develop a
structure-learning approach that derives both the structure and parameters of
CSPNs from data. Our experimental evidence demonstrates that CSPNs are
competitive with other probabilistic models and yield superior performance on
multilabel image classification compared to mean field and mixture density
networks. Furthermore, they can successfully be employed as building blocks for
structured probabilistic models, such as autoregressive image models.Comment: 13 pages, 6 figure
PIXAR: Auto-Regressive Language Modeling in Pixel Space
Recent work showed the possibility of building open-vocabulary large language
models (LLMs) that directly operate on pixel representations. These models are
implemented as autoencoders that reconstruct masked patches of rendered text.
However, these pixel-based LLMs are limited to discriminative tasks (e.g.,
classification) and, similar to BERT, cannot be used to generate text.
Therefore, they cannot be used for generative tasks such as free-form question
answering. In this work, we introduce PIXAR, the first pixel-based
autoregressive LLM that performs text generation. Consisting of only a decoder,
PIXAR can perform free-form generative tasks while keeping the number of
parameters on par with previous encoder-decoder models. Furthermore, we
highlight the challenges of generating text as non-noisy images and show this
is due to using a maximum likelihood objective. To overcome this problem, we
propose an adversarial pretraining stage that improves the readability and
accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to
GPT-2 on text generation tasks. This paves the way to build open-vocabulary
LLMs that operate on perceptual input only and calls into question the
necessity of the usual symbolic input representation, i.e., text as
(sub)tokens
Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and Mitigation of Reasoning Shortcuts
Neuro-Symbolic (NeSy) predictive models hold the promise of improved
compliance with given constraints, systematic generalization, and
interpretability, as they allow to infer labels that are consistent with some
prior knowledge by reasoning over high-level concepts extracted from
sub-symbolic inputs. It was recently shown that NeSy predictors are affected by
reasoning shortcuts: they can attain high accuracy but by leveraging concepts
with unintended semantics, thus coming short of their promised advantages. Yet,
a systematic characterization of reasoning shortcuts and of potential
mitigation strategies is missing. This work fills this gap by characterizing
them as unintended optima of the learning objective and identifying four key
conditions behind their occurrence. Based on this, we derive several natural
mitigation strategies, and analyze their efficacy both theoretically and
empirically. Our analysis shows reasoning shortcuts are difficult to deal with,
casting doubts on the trustworthiness and interpretability of existing NeSy
solutions.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023
How to Turn Your Knowledge Graph Embeddings into Generative Models
Some of the most successful knowledge graph embedding (KGE) models for link
prediction -- CP, RESCAL, TuckER, ComplEx -- can be interpreted as energy-based
models. Under this perspective they are not amenable for exact
maximum-likelihood estimation (MLE), sampling and struggle to integrate logical
constraints. This work re-interprets the score functions of these KGEs as
circuits -- constrained computational graphs allowing efficient
marginalisation. Then, we design two recipes to obtain efficient generative
circuit models by either restricting their activations to be non-negative or
squaring their outputs. Our interpretation comes with little or no loss of
performance for link prediction, while the circuits framework unlocks exact
learning by MLE, efficient sampling of new triples, and guarantee that logical
constraints are satisfied by design. Furthermore, our models scale more
gracefully than the original KGEs on graphs with millions of entities
- …